EDA On Data Science Job Salaries Dataset¶
Table of Content¶
I. Notebook Objectives¶
The objective of this notebook is to analyze and investigate salary trends in the field of data science across different positions and geographical locations worldwide. By examining salary data obtained from ai-jobs.net, we aim to gain insights into the factors influencing salary variations and understand how data analysis techniques can be utilized to extract meaningful insights from the dataset.¶
Specific Objectives:¶
- Understand how to analyze data to derive meaningful insights and trends.¶
- Investigate salary distributions and variations across different positions within the field of data science.¶
- Explore geographical differences in salary trends for data science positions around the world.¶
- Utilize data visualization techniques to present findings and enhance understanding.¶
Note: This notebook is intended for educational purposes and skill improvement. It should not be considered as formal research or a professional paper.¶
II. What is Data Science¶
Data science is a field that uses a mix of methods, algorithms, and tools to understand and make sense of different types of data. It combines ideas from statistics, math, computer science, and specific areas of expertise to analyze and interpret complex data sets. The data science lifecycle begins with business understanding, followed by data mining, data cleaning, exploration, advanced analytics, and insights visualization.¶
Data science is also pivotal for modern organizations, facilitating informed decision-making through data-driven insights and predictive analytics. It empowers businesses to anticipate trends and risks, driving proactive strategies. By transforming raw data into actionable insights, data science enhances business intelligence, fostering a deeper understanding of operations, customers, and markets for improved competitiveness.¶
Due to their significant value to organizations, data science positions are characterized by high demand and high compensation. In light of this, we aim to conduct an in-depth analysis of data science salaries spanning the years 2021 to 2024, to examine trends and patterns through comprehensive data analysis and visualization techniques¶
III. Import library¶
In [1]:
# Import image
from IPython.display import Image
# Data analysis
import pandas as pd
pd.plotting.register_matplotlib_converters()
import numpy as np
# Visualization
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
import datetime as dt
import warnings
import plotly.graph_objects as go
IV. Summaries Dataset¶
In [2]:
df=pd.read_csv('/Users/thienphuquach/Desktop/Research Paper/ds_salaries.csv')
df.head(10)
Out[2]:
| Unnamed: 0 | work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 2020 | MI | FT | Data Scientist | 70000 | EUR | 79833 | DE | 0 | DE | L |
| 1 | 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | USD | 260000 | JP | 0 | JP | S |
| 2 | 2 | 2020 | SE | FT | Big Data Engineer | 85000 | GBP | 109024 | GB | 50 | GB | M |
| 3 | 3 | 2020 | MI | FT | Product Data Analyst | 20000 | USD | 20000 | HN | 0 | HN | S |
| 4 | 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | USD | 150000 | US | 50 | US | L |
| 5 | 5 | 2020 | EN | FT | Data Analyst | 72000 | USD | 72000 | US | 100 | US | L |
| 6 | 6 | 2020 | SE | FT | Lead Data Scientist | 190000 | USD | 190000 | US | 100 | US | S |
| 7 | 7 | 2020 | MI | FT | Data Scientist | 11000000 | HUF | 35735 | HU | 50 | HU | L |
| 8 | 8 | 2020 | MI | FT | Business Data Analyst | 135000 | USD | 135000 | US | 100 | US | L |
| 9 | 9 | 2020 | SE | FT | Lead Data Engineer | 125000 | USD | 125000 | NZ | 50 | NZ | S |
work_year: The year the salary was paid.¶
experience_level: Experience level at the job during the year¶
employment_type: Type of employment (PT-Part-time, FT-Full-time, CT-Contract, FL-Freelance)¶
job_title: Job role during the year¶
salary: Total gross salary amount¶
salary_currency: Currency of salary, paid as an ISO 4217 currency code¶
salary_in_usd: Salary in USD¶
employee_residence: Employee's primary country of residence, as an ISO 3166 country code¶
remote_ratio: Overall amount of work done remotely¶
company_location: Country of employer's main office or contracting branch, as an ISO 3166 country code¶
company_size: Amount of people worked for the company¶
In [3]:
# Checking the type of each
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 607 entries, 0 to 606 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 607 non-null int64 1 work_year 607 non-null int64 2 experience_level 607 non-null object 3 employment_type 607 non-null object 4 job_title 607 non-null object 5 salary 607 non-null int64 6 salary_currency 607 non-null object 7 salary_in_usd 607 non-null int64 8 employee_residence 607 non-null object 9 remote_ratio 607 non-null int64 10 company_location 607 non-null object 11 company_size 607 non-null object dtypes: int64(5), object(7) memory usage: 57.0+ KB
In [4]:
# Convert the "remote_ratio" column to text
df['remote_ratio'] = df['remote_ratio'].astype(str)
# Verify the changes
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 607 entries, 0 to 606 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 607 non-null int64 1 work_year 607 non-null int64 2 experience_level 607 non-null object 3 employment_type 607 non-null object 4 job_title 607 non-null object 5 salary 607 non-null int64 6 salary_currency 607 non-null object 7 salary_in_usd 607 non-null int64 8 employee_residence 607 non-null object 9 remote_ratio 607 non-null object 10 company_location 607 non-null object 11 company_size 607 non-null object dtypes: int64(4), object(8) memory usage: 57.0+ KB
In [5]:
df['remote_ratio'].replace({'0':'No remote','50':'Partially remote','100':' Fully remote'},inplace=True)
df['company_size'].replace({'L':'Large','S':'Small','M':'Medium'},inplace=True)
df=df.drop('Unnamed: 0',axis=1)
df
Out[5]:
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | MI | FT | Data Scientist | 70000 | EUR | 79833 | DE | No remote | DE | Large |
| 1 | 2020 | SE | FT | Machine Learning Scientist | 260000 | USD | 260000 | JP | No remote | JP | Small |
| 2 | 2020 | SE | FT | Big Data Engineer | 85000 | GBP | 109024 | GB | Partially remote | GB | Medium |
| 3 | 2020 | MI | FT | Product Data Analyst | 20000 | USD | 20000 | HN | No remote | HN | Small |
| 4 | 2020 | SE | FT | Machine Learning Engineer | 150000 | USD | 150000 | US | Partially remote | US | Large |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 602 | 2022 | SE | FT | Data Engineer | 154000 | USD | 154000 | US | Fully remote | US | Medium |
| 603 | 2022 | SE | FT | Data Engineer | 126000 | USD | 126000 | US | Fully remote | US | Medium |
| 604 | 2022 | SE | FT | Data Analyst | 129000 | USD | 129000 | US | No remote | US | Medium |
| 605 | 2022 | SE | FT | Data Analyst | 150000 | USD | 150000 | US | Fully remote | US | Medium |
| 606 | 2022 | MI | FT | AI Scientist | 200000 | USD | 200000 | IN | Fully remote | US | Large |
607 rows × 11 columns
In [6]:
# Replace Acronym
df['experience_level'].replace({'EN':'Entry Level','MI':'Intermediate','EX':'Expert','SE':'Senior'},inplace=True)
df['employment_type'].replace({'PT':'Part-Time','FT':'Full-Time','CT':'Contract','FL':'Freelance'},inplace=True)
df
Out[6]:
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Intermediate | Full-Time | Data Scientist | 70000 | EUR | 79833 | DE | No remote | DE | Large |
| 1 | 2020 | Senior | Full-Time | Machine Learning Scientist | 260000 | USD | 260000 | JP | No remote | JP | Small |
| 2 | 2020 | Senior | Full-Time | Big Data Engineer | 85000 | GBP | 109024 | GB | Partially remote | GB | Medium |
| 3 | 2020 | Intermediate | Full-Time | Product Data Analyst | 20000 | USD | 20000 | HN | No remote | HN | Small |
| 4 | 2020 | Senior | Full-Time | Machine Learning Engineer | 150000 | USD | 150000 | US | Partially remote | US | Large |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 602 | 2022 | Senior | Full-Time | Data Engineer | 154000 | USD | 154000 | US | Fully remote | US | Medium |
| 603 | 2022 | Senior | Full-Time | Data Engineer | 126000 | USD | 126000 | US | Fully remote | US | Medium |
| 604 | 2022 | Senior | Full-Time | Data Analyst | 129000 | USD | 129000 | US | No remote | US | Medium |
| 605 | 2022 | Senior | Full-Time | Data Analyst | 150000 | USD | 150000 | US | Fully remote | US | Medium |
| 606 | 2022 | Intermediate | Full-Time | AI Scientist | 200000 | USD | 200000 | IN | Fully remote | US | Large |
607 rows × 11 columns
In [7]:
#Checking for null values
df.isnull().sum()
Out[7]:
work_year 0 experience_level 0 employment_type 0 job_title 0 salary 0 salary_currency 0 salary_in_usd 0 employee_residence 0 remote_ratio 0 company_location 0 company_size 0 dtype: int64
V. Data Analysis¶
1) Experience Level¶
In [8]:
# Group the DataFrame by 'experience_level' and calculate the count of job listings
experience_counts = df.groupby('experience_level', as_index=False)['salary_in_usd'].count().sort_values(by='salary_in_usd', ascending=False)
# Create the pie chart using Plotly Express
fig = px.pie(
experience_counts,
names='experience_level',
values='salary_in_usd',
color='experience_level',
hole=0.4,
labels={'experience_level': 'Experience Level', 'salary_in_usd': 'Count'},
template='plotly_dark',
title='<b>Total Jobs Based on Experience Level'
)
# Update the layout to adjust title position and legend orientation
fig.update_layout(
title_x=0.5,
legend=dict(
orientation='h',
yanchor='bottom',
y=1.02,
xanchor='right',
x=0.8,
)
)
# Show the pie chart
fig.show()
In this dataset, as seen from the treemap above, it's evident that Senior and Intermediate levels together make up more than 80%, while entry and expert levels account for only 14.5% and 4.28%, respectively.¶
The predominance of Senior and Intermediate levels in this dataset may be attributed to several factors. Primarily, a significant proportion of individuals in these roles hold advanced degrees, such as Master's and PhDs, which are typically associated with positions beyond entry-level.¶
Drawing from the methodology employed by Rosanne Bollen in "Competencies of the Sexiest Job of the 21st Century: Developing and Testing a Measurement Instrument for Data Scientists," our analysis involves a quantitative cross-sectional questionnaire research to assess the validity of our revised competency framework (Bollen, Competencies of the Sexiest Job of the 21st Century: Developing and Testing a Measurement Instrument for Data Scientists 2017). Examining Sample 1, it is notable that 68.4% of respondents possess a Master's degree, while 6.6% have attained a PhD or other advanced degree, indicating a high level of educational attainment within the field. Additionally, age and experience play a significant role, with individuals aged 26-30, 31-35, and 36-40 collectively constituting over 70% of the sample.¶
2) Popular Role in Data Science¶
In [9]:
z=df['job_title'].value_counts().head(15)
fig=px.bar(z,x=z.index,
y=z.values,
color=z.index,
text=z.values,
labels={'index':'Job Title','y':'total','text':'count'},
template='plotly_dark',
title='<b> Top Popular Roles in Data Science')
fig.update_layout(title_x=0.5)
fig.show()
The demand for Data Scientists, Data Engineers, and Data Analysts is evidently high worldwide. Particularly in the United States, as indicated by research such as "A Data Science Approach to Defining a Data Scientist" by Andy Ho, data collected from Indeed in 16 major American cities revealed approximately 8,738 job postings for these roles. It's worth noting that this dataset does not encompass other hiring platforms like LinkedIn. Within this dataset, there were nearly 1,742 positions specifically for Data Scientists.(Ho et al., A Data Science Approach to Defining a Data Scientist 2019)¶
2) Salaries in Data Science field¶
In [10]:
fig=px.bar(df.groupby('job_title',as_index=False)['salary_in_usd'].max().sort_values(by='salary_in_usd',
ascending=False).head(10),
x='job_title',
y='salary_in_usd',
color='job_title',
labels={'job_title':'job title','salary_in_usd':'salary in usd'},
template='plotly_dark',
text='salary_in_usd',
title='<b> Top 10 Highest Paid Roles in Data Science')
fig.update_layout(title_x=0.5)
fig.show()
In [11]:
sns.scatterplot(x=df.index, y='salary_in_usd', data=df, color='orange', alpha=0.6, label='Salary')
sns.regplot(x=df.index, y='salary_in_usd', data=df, scatter=False, color='red', label='Regression line')
# Customize the plot
plt.title('Salary Distribution')
plt.xlabel('Person')
plt.ylabel('Salary (USD)')
plt.legend() # Show legend with labels
plt.grid(True) # Add gridlines
plt.tight_layout() # Adjust layout to prevent overlapping labels
plt.show()
4) Remote ratio¶
In [12]:
fig=px.pie(df.groupby('remote_ratio',as_index=False)['salary_in_usd'].count().sort_values(by='salary_in_usd',ascending=False),
names='remote_ratio',
values='salary_in_usd',
color='remote_ratio',
hole=0.5,
labels={'remote_ratio':'remote ratio','salary_in_usd':'count'},
template='plotly_dark',
title='<b> Remote Ratio')
fig.update_layout(
title_x=0.5,
legend=dict(
yanchor='bottom',
y=0.4,
xanchor='right',
x=0.8,
)
)
After the COVID-19 pandemic, many companies have embraced remote work, especially in the IT/Data field where it's often feasible. This shift has given organizations ample opportunities to redefine their programming teams' work environment. They can tailor everything from office layout ergonomics to upgrading infrastructure and providing better tools, all while enhancing flexibility in work hours and formats. Determining which factors truly influence team efficiency and effectiveness can be challenging. According to this dataset, approximately 62.8% of people are now working fully remotely, compared to just around 21% who continue to work on-site.(Rot et al., Programming teams in Remote Working Environments: An analysis of performance and productivity 2023)¶
In [13]:
Image(filename='/Users/thienphuquach/Desktop/Research Paper/DS/Remote ratio by experience.png')
Out[13]:
5) Company Size¶
### Figure 1: Remote ratio by experience level (Created by Phu Quach)
Despite the substantial prevalence of remote work arrangements, recent research, such as that conducted by Artur Rot, Malgorzata Sobinska and Peter Busch in "Programming Teams in Remote Working Environments: an Analysis of Performance and Productivity," highlights a noteworthy finding. Namely, it reveals the apparent absence or marginal negative impact of remote work on the operational efficiency of development teams.¶
However, junior developers manifest a distinct sentiment, characterized by a perceived constraint in articulating and resolving issues collaboratively, an activity traditionally facilitated through in-person interaction at the monitor or whiteboard. This phenomenon suggests a plausible link to their relative lack of experience and depth of domain-specific knowledge compared to their senior counterparts, thus potentially inhibiting their seamless integration into remote work paradigms.(Rot et al., Programming teams in Remote Working Environments: An analysis of performance and productivity 2023)¶
In [14]:
fig=px.pie(df.groupby('company_size',as_index=False)['salary_in_usd'].count().sort_values(by='salary_in_usd',ascending=False).head(10),
names='company_size',
values='salary_in_usd',
color='company_size',
hole=0.5,labels={'company_size':'Company Size','salary_in_usd':'count'},
template='plotly_dark',
title='<b> Company Sizes in Data Science Feild')
fig.update_layout(title_x=0.5,legend=dict(orientation='h',yanchor='bottom',y=1.02,xanchor='right',x=1))
In the contemporary digital landscape, data has risen to prominence as a pivotal asset for businesses, shaping decision-making processes, fostering innovation, and propelling growth. The deployment of data science solutions, entailing the extraction of valuable insights from extensive datasets, has yielded substantial impacts on company revenues.¶
In the contemporary landscape of e-commerce, the convergence of big data and Customer Relationship Management (CRM) has catalyzed a significant paradigm shift. This transformative integration is not exclusive to large tech conglomerates but also extends its potential to medium and small-sized enterprises. CRM, a strategic concept devised to bolster companies' profitability, empowers organizations to pinpoint and prioritize their most lucrative customer segments. Particularly within the realm of electronic commerce, where Electronic Customer Relationship Management (e-CRM) reigns supreme, the application of CRM principles emerges as a linchpin for success. This holds especially true for business-to-consumer entities and smaller firms, which, despite lacking the vast resources of corporate giants, are compelled to adeptly manage customer relationships to thrive in the digital marketplace. Contrary to conventional wisdom equating a company's tenure on the web with its electronic commerce prowess, the authors posit that success hinges more significantly on the adept execution of CRM strategies. Thus, irrespective of a company's size or longevity in the digital sphere, the effective deployment of CRM methodologies emerges as the cornerstone for sustained competitiveness and growth.¶
Example based on the analysis¶
For mid-cap companies, a deeper examination of e.l.f. Beauty provides illuminating insights(e.l.f. Beauty, Inc, Annual report 2023). In the fiscal year ending March 31, 2023, the company witnessed a remarkable surge in net sales, soaring by 186.6 million dollars, marking a staggering 48 percent increase from the previous fiscal year's 392.2 million to 578.8 million dollars . This substantial growth was primarily fueled by robust performance across both retail and e-commerce channels. Notably, net sales surged by 158.0 million dollars, a notable 45%, within the retailer channels, while e-commerce channels experienced a remarkable $28.6 million increase, representing an impressive 71% spike. This exemplifies the efficacy of data collection and analysis methodologies employed by smaller companies to optimize revenue streams. Moreover, e.l.f. Beauty underscores the pivotal role of its e-commerce operations within its business model. Their e-commerce platforms serve as integral extensions of their marketing strategies, acting as vital touchpoints for introducing prospective consumers to their brand, product offerings, and enriched content.However, amidst these opportunities lie inherent vulnerabilities. The reliance on e-commerce platforms exposes the company to risks such as website downtime and technical failures. A failure to swiftly address these challenges could potentially curtail e-commerce sales and tarnish the brand's reputation. Thus, it becomes imperative for companies like e.l.f. Beauty to proactively mitigate such risks to ensure sustained growth and brand resilience in the digital sphere.¶
For the leading entity in the e-commerce domain, Amazon unequivocally stands out as the preeminent corporation, boasting a colossal user base numbering in the billions. According to research on Big Data Techniques of Google, Amazon, Facebook, and Twitter, Amazon currently hosts over 300 million active users (Hewage et al., Review: Big data techniques of google, Amazon, Facebook and Twitter 2018). Within this expansive dataset lies a profound emphasis on Customer Relationship Management (CRM), underlining its critical significance within electronic commerce. Amazon's strategic utilization of big data techniques for CRM exemplifies a sophisticated approach to managing customer interactions and data, serving as a benchmark within the industry. With an intricate understanding of consumer behavior and preferences gleaned from extensive data analysis, Amazon adeptly tailors its services to individual users. This personalized approach manifests in targeted marketing initiatives, precise product recommendations, and refined customer segmentation strategies.Moreover, Amazon's mastery of big data extends beyond CRM, encompassing pivotal functions such as supply chain optimization, inventory management, fraud prevention, and logistical enhancements. By harnessing its robust infrastructure and cutting-edge analytics capabilities, Amazon processes vast data volumes in real-time, facilitating agile decision-making and fostering ongoing innovation.¶
In [15]:
Image(filename='/Users/thienphuquach/Desktop/Research Paper/DS/dashboard2.png')
Out[15]:
## Figure 1 : Tableau Dashboard Data Science Salary around the world by USD from 2020 to 2023 (Created by Phu Quach)
Following in-depth analysis, it's evident that data science has achieved global prominence, particularly in regions such as Europe, North America, Austria, and select areas in Asia. This widespread recognition underscores the escalating demand for expertise in data-related fields, indicating a substantial surge in their significance. This trend accentuates the growing prominence of data science across diverse industries and sectors.¶
VII. Import New Dataset¶
In [16]:
df2=pd.read_csv('/Users/thienphuquach/Desktop/salaries.csv')
df2
Out[16]:
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | MI | FT | Prompt Engineer | 56000 | EUR | 60462 | DE | 0 | DE | S |
| 1 | 2024 | EN | FT | Research Engineer | 172800 | USD | 172800 | US | 0 | US | M |
| 2 | 2024 | EN | FT | Research Engineer | 144000 | USD | 144000 | US | 0 | US | M |
| 3 | 2024 | MI | FT | Data Engineer | 65000 | EUR | 72222 | AT | 0 | AT | M |
| 4 | 2024 | MI | FT | Data Engineer | 43000 | EUR | 47777 | AT | 0 | AT | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12977 | 2020 | SE | FT | Data Scientist | 412000 | USD | 412000 | US | 100 | US | L |
| 12978 | 2021 | MI | FT | Principal Data Scientist | 151000 | USD | 151000 | US | 100 | US | L |
| 12979 | 2020 | EN | FT | Data Scientist | 105000 | USD | 105000 | US | 100 | US | S |
| 12980 | 2020 | EN | CT | Business Data Analyst | 100000 | USD | 100000 | US | 100 | US | L |
| 12981 | 2021 | SE | FT | Data Science Manager | 7000000 | INR | 94665 | IN | 50 | IN | L |
12982 rows × 11 columns
In [17]:
df2['company_size'].replace({'L':'Large','S':'Small','M':'Medium'},inplace=True)
df2['experience_level'].replace({'EN':'Entry Level','MI':'Intermediate','EX':'Expert','SE':'Senior'},inplace=True)
df2['employment_type'].replace({'PT':'Part-Time','FT':'Full-Time','CT':'Contract','FL':'Freelance'},inplace=True)
df2['remote_ratio'] = df2['remote_ratio'].astype(str)
df2['remote_ratio'].replace({'0':'No remote','50':'Partially remote','100':' Fully remote'},inplace=True)
df2
Out[17]:
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2023 | Intermediate | Full-Time | Prompt Engineer | 56000 | EUR | 60462 | DE | No remote | DE | Small |
| 1 | 2024 | Entry Level | Full-Time | Research Engineer | 172800 | USD | 172800 | US | No remote | US | Medium |
| 2 | 2024 | Entry Level | Full-Time | Research Engineer | 144000 | USD | 144000 | US | No remote | US | Medium |
| 3 | 2024 | Intermediate | Full-Time | Data Engineer | 65000 | EUR | 72222 | AT | No remote | AT | Medium |
| 4 | 2024 | Intermediate | Full-Time | Data Engineer | 43000 | EUR | 47777 | AT | No remote | AT | Medium |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12977 | 2020 | Senior | Full-Time | Data Scientist | 412000 | USD | 412000 | US | Fully remote | US | Large |
| 12978 | 2021 | Intermediate | Full-Time | Principal Data Scientist | 151000 | USD | 151000 | US | Fully remote | US | Large |
| 12979 | 2020 | Entry Level | Full-Time | Data Scientist | 105000 | USD | 105000 | US | Fully remote | US | Small |
| 12980 | 2020 | Entry Level | Contract | Business Data Analyst | 100000 | USD | 100000 | US | Fully remote | US | Large |
| 12981 | 2021 | Senior | Full-Time | Data Science Manager | 7000000 | INR | 94665 | IN | Partially remote | IN | Large |
12982 rows × 11 columns
In [18]:
combined_df = pd.concat([df, df2], ignore_index=True)
combined_df
Out[18]:
| work_year | experience_level | employment_type | job_title | salary | salary_currency | salary_in_usd | employee_residence | remote_ratio | company_location | company_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | Intermediate | Full-Time | Data Scientist | 70000 | EUR | 79833 | DE | No remote | DE | Large |
| 1 | 2020 | Senior | Full-Time | Machine Learning Scientist | 260000 | USD | 260000 | JP | No remote | JP | Small |
| 2 | 2020 | Senior | Full-Time | Big Data Engineer | 85000 | GBP | 109024 | GB | Partially remote | GB | Medium |
| 3 | 2020 | Intermediate | Full-Time | Product Data Analyst | 20000 | USD | 20000 | HN | No remote | HN | Small |
| 4 | 2020 | Senior | Full-Time | Machine Learning Engineer | 150000 | USD | 150000 | US | Partially remote | US | Large |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 13584 | 2020 | Senior | Full-Time | Data Scientist | 412000 | USD | 412000 | US | Fully remote | US | Large |
| 13585 | 2021 | Intermediate | Full-Time | Principal Data Scientist | 151000 | USD | 151000 | US | Fully remote | US | Large |
| 13586 | 2020 | Entry Level | Full-Time | Data Scientist | 105000 | USD | 105000 | US | Fully remote | US | Small |
| 13587 | 2020 | Entry Level | Contract | Business Data Analyst | 100000 | USD | 100000 | US | Fully remote | US | Large |
| 13588 | 2021 | Senior | Full-Time | Data Science Manager | 7000000 | INR | 94665 | IN | Partially remote | IN | Large |
13589 rows × 11 columns
In [19]:
px.violin(combined_df,
x='work_year',
y='salary_in_usd',
color='work_year',
labels={'work_year':'Year','salary_in_usd':'salary in usd'},
template='plotly_dark',
title='<b>Data Science Salaries by year')
In [20]:
Image(filename='/Users/thienphuquach/Desktop/Research Paper/DS/Dashboard1.PNG')
Out[20]:
Figure 2 : Tableau Dashboard Data Science Salary around the world by USD between 2023 and 2024 (Created by Phu Quach)